Japanese text segmentation : a comparison of different methods applied to Kanji

نویسندگان

  • Andy Bartholomew
  • John Donaldson
چکیده

Written Japanese and Chinese contain no word delimiters such as spaces, so segmentation into words is the first step in processing text in these languages. Over the years several methods of segmentation that utilize various statistical and grammatical principles have been developed. I have implemented two such methods, the Tango algorithm by Ando and Lee and a hidden markov model by Papageorgiou. In this paper, I will contrast these two methods with an open-source knowledge-based parser Juman 5.1 in their ability to segment Japanese text.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and grammar or on pre-segmented data. In contrast, we introduce a novel statistical method utilizing unsegmented training data, with performance on kanji sequences comparable to and s...

متن کامل

Strategies of Processing Japanese Names and Character Variants in Traditional Chinese Text

This paper proposes an approach to identify word candidates that are not Traditional Chinese, including Japanese names (written in Japanese Kanji or Traditional Chinese characters) and word variants, when doing word segmentation on Traditional Chinese text. When handling personal names, a probability model concerning formats of names is introduced. We also propose a method to map Japanese Kanji...

متن کامل

h . R ep or t T R 99 - 1 75 6 Unsupervised Statistical Segmentation of Japanese Kanji

Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character n-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of ...

متن کامل

T R 99 - 1 75 6 Unsupervised Statistical Segmentation of Japanese Kanji Strings

Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character n-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of ...

متن کامل

The role of interword spacing in reading Japanese: An eye movement study

The present study investigated the role of interword spacing in a naturally unspaced language, Japanese. Eye movements were registered of native Japanese readers reading pure Hiragana (syllabic) and mixed Kanji-Hiragana (ideographic and syllabic) text in spaced and unspaced conditions. Interword spacing facilitated both word identification and eye guidance when reading syllabic script, but not ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008